home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Programmierung
/
Power-Programmierung (Tewi)(1994).iso
/
magazine
/
drdobbs
/
1990
/
11
/
e_floyd
/
applicat.txt
< prev
next >
Wrap
Text File
|
1990-07-24
|
8KB
|
152 lines
Applications
Spelling Checker:
With superimposed coding, we can build a spelling checker that
has some unusual properties. First, use the results of one of
the studies of word usage frequency to build a "primary"
dictionary bit table from the 10000 most frequently used English
words. At 13 bits per key, the optimal bit table will occupy
about 23k bytes and will have a false drop probability of about
1/8000. Then, partition the remaining words by first letter into
lists of 10000 or fewer words. Create a "secondary" bit table
for each partition and save it on disk. The spelling checker
program can keep the primary bit table in memory and can allocate
an auxiliary, 23k buffer for one of the secondary tables. The
program will find most words in the primary bit table. For rare
or misspelled words, the program will usually have to read a bit
table from disk into the auxiliary buffer. Still, it should be
fast enough to keep up with real-time typing. Overall, the
checker will fail to catch about 1/4000 misspelled words, the
same as Doug McIlroy's program.
Now we have a complete, conventional spelling checker with good
accuracy, but small enough (under 60k) to reside constantly in
memory, perhaps as a DOS TSR. If we allocate a primary bit table
for 12000 words (28k), the program will be capable of learning up
to 2000 new words without compromising its design target for
accuracy.
Document retrieval:
To help researchers locate information, electronic libraries
index documents and abstracts by author and subject keyword
search terms. Sometimes as many as ten search terms are
associated with a single document. Instead of a conventional
index, we can use superimposed coding to speed up the search.
Build for each document a small bit table superimposing five bits
for each index term. This bit table is sometimes called a
"surrogate" or "signature"; I'll call it a surrogate. The
surrogate may be quite small, say 128 bits. The false drop rate
for a 128-bit surrogate with ten, 5-bit keywords is less than
(50/128)@SUPERSCRIPT[5], or 1/110. Let's say we want to locate all the
documents associated with the keyword "programmer". We could
test each surrogate for the keyword as described above. However,
there's no need to recompute the hash function and bit positions
for the search keyword each time. Instead, build a 128-bit
search table, like a surrogate but with only the 5 bits "on" for
the single word "programmer". If a document surrogate has even
one bit "off" that is "on" in the search table, the document
can't possibly be associated with the keyword "programmer". We
can compare the search table with each surrogate using Boolean
logic: Perform a logical "AND" between each document's surrogate
and the search table and compare the result to the original
search table. If they are not equal, we can reject the document
immediately. If they are equal, we must investigate the documentèfurther to determine whether or not it is associated with the
search term.
1001011000101.. Surrogate for document 1
0001001001000.. Search table
───────────── AND
0001001000000.. Reject document 1
^
0010010111010.. Surrogate for document 2
0001001001000.. Search table
───────────── AND
0000000001000.. Reject document 2
^ ^
0101011011010.. Surrogate for document 3
0001001001000.. Search table
───────────── AND
0001001001000.. Investigate document 3
If we want to search for a conjunction of two terms, say
"programmer" AND "sociopath", the same technique applies. Build
a search table with bits for both terms. With each surrogate,
perform the logical "AND" and compare. This time, all ten bits
in the search table must also be "on" in the document table. The
false drop probability is now about (1/110)*(1/110), or 1/12100.
So, the search becomes more efficient as we search for more
conjoined terms.
To search for a disjunction of two terms, say "programmer" OR
"hacker", we must build two search tables and investigate further
any document whose surrogate satisfies either search. The false
drop probability, therefore, is half the probability of a
single-term search, or about 1/55. By extending these
techniques, we can construct complex queries. One shortcoming -
we can't use surrogates to speed up negation (NOT) in a query.
Fortunately, in document retrieval, negation is frequently
associated with positive conditions that restrict the set of
documents anyway.
Finally, we can save search time by employing a peculiar
"sideways" ordering of surrogate bits. Gather together all the
first bits from a group of surrogates and store them in document
order at the beginning of the surrogate file. Follow this with
all the second bits, and so on. Think of the file as columns of
surrogate bits. Now we can search the group of document
surrogates in parallel. Simply "AND" together the bit columns
corresponding to the "on" bits in the search table and the result
is a bit map whose "on" bits correspond to documents that we must
investigate further. If we block surrogates in groups of 4096
documents, each bit column will occupy 512 bytes - one sector on
many modern disk subsystems. This kind of grouping simplifies
calculations and optimizes hardware performance for bit column
retrieval. In the example fragment shown below, for instance, it
is necessary to read only the fourth, seventh, and tenth sectors
of the surrogate file to compute a bit map for the first 4k
documents. è
0 0 0 1 0 0 1 0 0 1 0 0 0... Search table
Sur | AND | AND | = Bit Map
--- V V V -------
1. 1 0 0 1 0 1 1 0 0 0 1 0 1... 0 Reject
2. 0 0 1 0 0 1 0 1 1 1 0 1 0... 0 Reject
3. 0 1 0 1 0 1 1 0 1 1 0 1 0... 1 Investigate
4. 0 1 1 0 0 0 1 1 0 0 0 0 1... 0 Reject
5. 1 0 0 1 0 0 1 0 1 1 0 0 1... 1 Investigate
. . . . .
. . . . .
Database:
We can use surrogate files instead of conventional indices in
database applications, also. For example, consider a
million-record database with fifty fields in each record. The
average size of each field is 10 bytes, so the database size is
about 500 megabytes. We want to index the database on all fifty
fields. As an optimistic estimate, a conventional index would
require about 15 megabytes for each field for a minimum total
index size of 750 megabytes. The index is larger than the
database. On the other hand, an optimal surrogate for 50 keys at
10 bits per key is about 90 bytes for a total surrogate file size
of 90 megabytes. The surrogate file is almost an order of
magnitude smaller than a conventional index. It is likely to be
slower than a conventional index for single-field queries.
However, for complex queries involving Boolean combinations of
field values, it can be considerably faster than an conventional
index. (Implementation note: Distinguish between fields by
prefixing each field value with a domain identifier before
hashing bits for the surrogate.)
When the database is fairly stable and a large proportion of
queries against the index result in a record-not-found condition,
a superimposed code dictionary can save time. For example,
consider a 10000 record database indexed on one field. A
superimposed code dictionary for 10000 keys with 10 bits per key
is about 18k bytes - small enough to hold in memory. Before
searching the index for a key, check for its existence in the
dictionary. If it's not in the dictionary, there's no need to
search the index. We will have reduced unnecessary index
searches by a factor of a thousand.